Goto

Collaborating Authors

 base rate



Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset

arXiv.org Artificial Intelligence

Non-invasive brain-computer interfaces (BCIs) are beginning to benefit from large, public benchmarks. However, current benchmarks target relatively simple, foundational tasks like Speech Detection and Phoneme Classification, while application-ready results on tasks like Brain-to-Text remain elusive. We propose Keyword Spotting (KWS) as a practically applicable, privacy-aware intermediate task. Using the deep 52-hour, within-subject LibriBrain corpus, we provide standardized train/validation/test splits for reproducible benchmarking, and adopt an evaluation protocol tailored to extreme class imbalance. Concretely, we use area under the precision-recall curve (AUPRC) as a robust evaluation metric, complemented by false alarms per hour (FA/h) at fixed recall to capture user-facing trade-offs. To simplify deployment and further experimentation within the research community, we are releasing an updated version of the pnpl library with word-level dataloaders and Colab-ready tutorials. As an initial reference model, we present a compact 1-D Conv/ResNet baseline with focal loss and top-k pooling that is trainable on a single consumer-class GPU. The reference model achieves approximately 13x the permutation baseline AUPRC on held-out sessions, demonstrating the viability of the task. Exploratory analyses reveal: (i) predictable within-subject scaling - performance improves log-linearly with more training hours - and (ii) the existence of word-level factors (frequency and duration) that systematically modulate detectability.


Quantum Causality: Resolving Simpson's Paradox with $\mathcal{DO}$-Calculus

arXiv.org Artificial Intelligence

Distinguishing correlation from causation is a fundamental challenge in machine intelligence, often representing a critical barrier to building robust and trustworthy systems. While Pearl's $\mathcal{DO}$-calculus provides a rigorous framework for causal inference, a parallel challenge lies in its physical implementation. Here, we apply and experimentally validate a quantum algorithmic framework for performing causal interventions. Our approach maps causal networks onto quantum circuits where probabilistic links are encoded by controlled-rotation gates, and interventions are realized by a structural remodeling of the circuit -- a physical analogue to Pearl's ``graph surgery''. We demonstrate the method's efficacy by resolving Simpson's Paradox in a 3-qubit model, and show its scalability by quantifying confounding bias in a 10-qubit healthcare simulation. Critically, we provide a proof-of-principle experimental validation on an IonQ Aria quantum computer, successfully reproducing the paradox and its resolution in the presence of real-world noise. This work establishes a practical pathway for quantum causal inference, offering a new computational tool to address deep-rooted challenges in algorithmic fairness and explainable AI (XAI).




PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark

arXiv.org Artificial Intelligence

Phishing remains a pervasive and growing threat, inflicting heavy economic and reputational damage. While machine learning has been effective in real-time detection of phishing attacks, progress is hindered by lack of large, high-quality datasets and benchmarks. In addition to poor-quality due to challenges in data collection, existing datasets suffer from leakage and unrealistic base rates, leading to overly optimistic performance results. In this paper, we introduce PhreshPhish, a large-scale, high-quality dataset of phishing websites that addresses these limitations. Compared to existing public datasets, PhreshPhish is substantially larger and provides significantly higher quality, as measured by the estimated rate of invalid or mislabeled data points. Additionally, we propose a comprehensive suite of benchmark datasets specifically designed for realistic model evaluation by minimizing leakage, increasing task difficulty, enhancing dataset diversity, and adjustment of base rates more likely to be seen in the real world. We train and evaluate multiple solution approaches to provide baseline performance on the benchmark sets. We believe the availability of this dataset and benchmarks will enable realistic, standardized model comparison and foster further advances in phishing detection. The datasets and benchmarks are available on Hugging Face (https://huggingface.co/datasets/phreshphish/phreshphish).


BiasICL: In-Context Learning and Demographic Biases of Vision Language Models

arXiv.org Artificial Intelligence

Vision language models (VLMs) show promise in medical diagnosis, but their performance across demographic subgroups when using in-context learning (ICL) remains poorly understood. We examine how the demographic composition of demonstration examples affects VLM performance in two medical imaging tasks: skin lesion malignancy prediction and pneumothorax detection from chest radiographs. Our analysis reveals that ICL influences model predictions through multiple mechanisms: (1) ICL allows VLMs to learn subgroup-specific disease base rates from prompts and (2) ICL leads VLMs to make predictions that perform differently across demographic groups, even after controlling for subgroup-specific disease base rates. Our empirical results inform best-practices for prompting current VLMs (specifically examining demographic subgroup performance, and matching base rates of labels to target distribution at a bulk level and within subgroups), while also suggesting next steps for improving our theoretical understanding of these models.


Reviews: Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning

Neural Information Processing Systems

The basic problem studied in the paper concerns learning from data which is only partially labelled, but nonetheless doing better than with fully labelled data. In the particular scenario where one has a pool of unlabelled data in lieu of one of the classes, the paper seeks to quantify the impact this has on the estimation error of the learned classifier. Determining the degradation or lack thereof when learning from unlabelled data is interesting, and thus the paper seems well motivated. The machinery used to illustrate its messages are fairly standard -- the key quantities, namely the estimation errors for each scenario, are derived from a simple Rademacher analysis -- however, the final results appear novel, with implications worked through in various scenarios. The papers hinges on the simple facts that (a) different risks (for PN/PU/NU learning) may be seen as employing different weightings on individual risks for the positive and negative class, and (b) these ratings are reflected in appropriate terms for the estimation error.


Probabilistic adaptation of language comprehension for individual speakers: Evidence from neural oscillations

arXiv.org Artificial Intelligence

Listeners adapt language comprehension based on their mental representations of speakers, but how these representations are dynamically updated remains unclear. We investigated whether listeners probabilistically adapt their comprehension based on the likelihood of speakers producing stereotype-incongruent utterances. Our findings reveal two potential mechanisms: a speaker-general mechanism that adjusts overall expectations about speaker-content relationships, and a speaker-specific mechanism that updates individual speaker models. In two EEG experiments, participants heard speakers make stereotype-congruent or incongruent utterances, with incongruency base rate manipulated between blocks. In Experiment 1, speaker incongruency modulated both high-beta (21-30 Hz) and theta (4-6 Hz) oscillations: incongruent utterances decreased oscillatory power in low base rate condition but increased it in high base rate condition. The theta effect varied with listeners' openness trait: less open participants showed theta increases to speaker-incongruencies, suggesting maintenance of speaker-specific information, while more open participants showed theta decreases, indicating flexible model updating. In Experiment 2, we dissociated base rate from the target speaker by manipulating the overall base rate using an alternative non-target speaker. Only the high-beta effect persisted, showing power decrease for speaker-incongruencies in low base rate condition but no effect in high base rate condition. The high-beta oscillations might reflect the speaker-general adjustment, while theta oscillations may index the speaker-specific model updating. These findings provide evidence for how language processing is shaped by social cognition in real time.


Dual Traits in Probabilistic Reasoning of Large Language Models

arXiv.org Artificial Intelligence

We conducted three experiments to investigate how large language models (LLMs) evaluate posterior probabilities. Our results reveal the coexistence of two modes in posterior judgment among state-of-the-art models: a normative mode, which adheres to Bayes' rule, and a representative-based mode, which relies on similarity -- paralleling human System 1 and System 2 thinking. Additionally, we observed that LLMs struggle to recall base rate information from their memory, and developing prompt engineering strategies to mitigate representative-based judgment may be challenging. We further conjecture that the dual modes of judgment may be a result of the contrastive loss function employed in reinforcement learning from human feedback. Our findings underscore the potential direction for reducing cognitive biases in LLMs and the necessity for cautious deployment of LLMs in critical areas. The remarkable advancements in large language models (LLMs) have ushered in a new era where these models rival human expertise across domains like academia, law, medicine, and finance [4, 12, 13, 22-24]. In this study, we explore how LLMs judge this posterior probability. A higher similarity corresponds to a higher assessed posterior probability. This study comprises three experiments with progressively stricter conditions, reducing the information available for posterior likelihood assessment. The structured test provides all information needed for normative judgment, the semi-structured test omits the diagnosticity of evidence, and the unstructured test requires LLMs to recall all components of Bayes' rule. Results reveal that LLMs' judgments shift from f Representativeness can be constructed through typicality or prototypicality. Typicality describes the common or average case of the class, whereas prototypicality embodies the most idealized and iconic version of the class. For instance, a typical example of a physicist is a smart man who likes math and physics, while a prototypical example of a physicist is Stephen Hawking. This study moves beyond bias detection to investigate the basis upon which LLMs assess probabilities. This has important practical implications for the integration of LLMs into various critical fields.